Storage system

ABSTRACT

A storage system and a method for storing a data segment, a storage capacity manager and a method for managing a capacity of a storage unit, and a storage tier relocation manager and a method for relocating a data segment. The storage system includes at least two storage tiers, an access pattern evaluator, a classification unit, a selector, and logic. The storage capacitor manager includes a monitoring unit and capacity managing unit. The storage tier relocation manager includes a target storage tier, the data segment relocated to the target storage tier, and a protection measure.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of priority from United KingdomPatent Application No. 1415248.2, filed Aug. 28, 2014, the contents ofwhich are incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to a storage system. More particularly,the present invention relates to a method for storing a data segment ina storage tier of a storage unit including at least two storage tiers.

BACKGROUND

Today's multi-tiered storage systems are suited for offering a trade-offbetween high performance and efficient low-cost long-term storage ofdata. However, very limited intelligence is usually available todetermine without human intervention within which tier a certain datafile should be stored. While today's approach can be adequate for mostapplications given the number and size of data files that need to bestored and retrieved, it appears that a new paradigm is needed toaddress the challenges posed by applications where a very large amountof data is to be stored and valuable information reliably is to beidentified and accessed. Examples of so-called big data applications areemerging in various fields, including social networks, sensor networks,and huge archives of business, scientific and government records. One ofthe critical big data challenges, however, is represented by a SquareKilometer Array (“SKA”) telescope, expected to be completed in 2024,whose antennas will gather tens of exabytes of data and store petabytesof data every day. Another significant big data challenge lies in thehealthcare industry, where personalized medicine and large-scale cohortstudies can require storage of medical data for extended periods oftime.

SUMMARY OF THE INVENTION

The present invention provides a storage system including a storage unitincluding at least two storage tiers, and an access pattern evaluatorconfigured to provide information about a frequency at which datasegments stored in at least one of the at least two storage tiers areaccessed; in a classification unit, at least one out of a set of atleast two relevance classes is assigned to a data segment received forstoring in the storage unit dependent on information included in thedata segment; in a selector a storage tier out of the at least twostorage tiers is determined for storing the classified data segmentdependent on at least access frequency information provided by theaccess pattern evaluator for data segments in the same relevance class.In addition, a level of protection is determined in the selector for theclassified data segment dependent on at least the relevance classassigned. Logic is provided for storing the classified data segmentincluding the assigned relevance class to the determined storage tierand according to the determined level of protection.

The present invention also provides a method for storing a data segmentin a storage tier of a storage unit including at least two storagetiers. At least one out of a set of at least two relevance classes isassigned to the data segment dependent on information included in thedata segment. Information is received about a frequency at which datasegments stored in at least one of the at least two storage tiers areaccessed. A storage tier out of the at least two storage tiers isdetermined for storing the classified data segment dependent on at leastthe access frequency information received for data segments in the samerelevance class. A level of protection is determined for the classifieddata segment dependent on at least the relevance class assigned. Theclassified data segment including the assigned relevance class is storedto the determined storage tier and according to the determined level ofprotection.

The present invention also provides a storage tier relocation managerfor relocating a data segment presently stored in a storage tier of astorage unit with at least two storage tiers, the data segment havingassigned a protection level out of a set of protection levels. Thestorage tier relocation manager is configured to determine a targetstorage tier for the data segment dependent on access frequencyinformation received for the data segment or for a relevance class towhich the data segment is assigned. The storage tier relocation manageris further configured to relocate the data segment to the target storagetier if the target storage tier is different from the present storagetier, and in this case to apply a protection measure suitable forachieving the assigned protection level.

The present invention also provides a method for relocating a datasegment presently stored in a storage tier of a storage unit with atleast two storage tiers, the data segment having assigned a protectionlevel out of a set of protection levels. A target storage tier isdetermined for the data segment dependent on access frequencyinformation received for the data segment or for a relevance class towhich the data segment is assigned. The data segment is relocated to thetarget storage tier if the target storage tier is different from thepresent storage tier. In this case a protection measure is appliedsuitable for at least achieving the assigned protection level.

The present invention also provides a storage capacity manager for astorage system including a storage unit for storing data segments havingat least one relevance class assigned per data segment. The storagecapacity manager includes a monitoring unit for determining if autilization of the storage unit fulfils a criterion, and a capacitymanaging unit for, in response to the utilization of the storage unitfulfilling the criterion, selecting at least one data segment stored inthe storage unit for one of a deletion thereof or a deletion of a copythereof at least dependent on the relevance class assigned.

The present invention also provides a method for managing a capacity ofa storage unit for storing data segments having at least one relevanceclass out of a set of at least two relevance classes assigned per datasegment. It is determined if a utilization of the storage unit fulfils acriterion. In response to the utilization of the storage unit fulfillingthe criterion, at least one data segment stored in the storage unit isselected for one of a deletion thereof or a deletion of a copy thereofat least dependent on the relevance class assigned.

It is understood that method steps can be executed in a different orderthan listed in a method claim. Such different order shall also beincluded in the scope of such claim as is the order of steps aspresently listed.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention and its embodiments will be more fully appreciated byreference to the following detailed description of presently preferredbut nonetheless illustrative embodiments in accordance with the presentinvention when taken in conjunction with the accompanying drawings.

FIG. 1, a block diagram of a storage system according to an embodimentof the present invention;

FIG. 2, a flowchart of a method for storing a data segment in a storagetier of a storage unit including at least two storage tiers, accordingto an embodiment of the present invention;

FIG. 3, a flowchart of a method for relocating a data segment presentlystored in a storage tier of a storage unit, according to an embodimentof the present invention; and

FIG. 4, a flowchart of a method for managing a capacity of a storageunit for storing data segments, according to an embodiment of thepresent invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As an introduction to the following description, it is first pointed atgeneral aspects of the invention.

A storage system is understood as a tiered storage system once itincludes multiple tiers of storage. Different storage tiers preferablyare embodied as storage devices of different technology, including butnot limited to tape storage technology, hard disk drive (HDD) storagetechnology, solid state drive (SDD) storage technology such as flashstorage, etc. The storage devices can offer different characteristicsper storage tier, that can e.g. include storage volume, reliability e.g.measured in form of bit error rates, performance including access time,cost, term of storage, etc., such that when combining different storagetechnologies into a tiered storage system, considerable advantages canbe achieved given that storage devices with different characteristicscan be selected for storing data segments subject to the needs of thedifferent data segments to be stored. It is preferred that in amulti-tiered storage system each storage tier includes only one type ofstorage device. In a different embodiment, different tiers of a tieredstorage can be based on the same storage technology. However, they canstill show different characteristics owed to the usage of storagedevices of different generations or of electrical connections ofdifferent quality, such that these storage devices differ in at leastone characteristic that can be relevant for storing data segments. Suchcan impact the decision on which kind of storage device to place a datasegment.

Generally, the different storage tiers can not necessarily reside at acommon location or in a common housing but can be distributed as long asthe classification unit to be introduced later on has access to thestorage tiers and can store data segments to and retrieve data segmentsfrom the various storage tiers. Each storage tier as such can contain atleast one physical device. For example, an HDD storage tier can containup to hundreds HDD, or in a different embodiment, only a single HDD.

Data segments shall include any unit of data to be stored and in case ofa tiered storage system, any unit of data that can be a subject of anindividual decision as to where in the tiered storage it is desired tobe stored. Data segments can include at least one block, page, segment,file, objects file, portions of a data stream, etc.

An access pattern evaluator is configured to monitor accesses to thedata segments stored in the storage unit, i.e. in at least one andpreferably all of the various storage tiers. Accesses can in particularencompass read and/or write operations on a data segment, e.g. when auser of the storage system reads data segments from the tiered storage.The access pattern evaluator is configured to output such accesspatterns in form of access frequencies for the data segments. In apreferred embodiment, the access frequencies are not individuallymonitored and supplied, but access frequencies are provided with respectto relevance classes into which the stored data segments are classifiedas will be explained later on. The access pattern evaluator can beembodied in one of hardware, software or a combination thereof.

It is noted that the access pattern evaluator provides statistical datain form of an access frequency. However, any other statistical datareferring to accesses of stored data segments is meant to be subsumedunder the term access frequency. The access pattern evaluator providesaccess frequencies for stored data segments, and as such evaluates apopularity of the data segments since the access frequency can beregarded as a measure of the popularity of a data segment or a relevanceclass into which the data segment is classified. The higher the accessfrequency the more popular the data segment or the correspondingrelevance class is.

The storage system further includes a selector. The selector can beimplemented in hardware, software or a combination of both. For everydata segment to decide on, the selector receives, possibly amongstothers, the relevance class assigned to the data segment by theclassification unit and the access frequency information provided by theaccess pattern evaluator for data segments, and preferably for theparticular relevance class to which the present data segment isassigned. Based on this information, i.e. the relevance class and theaccess frequency information for this relevance class, the selectordetermines a level of protection the data segment is to be stored with,and a storage tier in which the data segment is to be stored.

Hence, the selector takes its decision at least based on the content ofthe data segment, whose content is mapped into a relevance class asexplained above. In one embodiment, a data segment with a high-rankrelevance class is associated with a high value. Accordingly, itscontent is such that a loss would be associated with a high cost.Therefore, this data segment deserves a higher level of protectioncompared to data segments with lower value.

A classification unit as used in the context of the present invention isconfigured to classify a data segment that is requested to be stored inthe storage unit into a relevance class. In one embodiment, two or morerelevance classes can be assigned to a data segment, although thedescription mostly refers to one relevance class being assigned. A setof relevance classes from which a relevance class is selected forassignment preferably includes at least two relevance classes. Theclassification of a data segment into a relevance class preferably isbased on information included in the data segment to be classified; thisinformation is also referred to as the content of the data segment.Hence, it is the content of the data segment that is evaluated forperforming the classification. However, in another embodiment,information included in other data segments, e.g. data segments that arelinked in time or space to the data segment to be classified, can alsobe evaluated for assigning a relevance class to the data segment.

The classification unit can take different embodiments. Subject to thecomplexity of information in the data segments and the number of datasegments arriving for storage, it can be preferred, that an eventdetector is provided. Such an event detector can evaluate the datasegments to be classified for an occurrence of at least one pre-definedevent. An event detector can evaluate a data segment on its own, ormultiple data segments in combination. In an example of data segmentsrepresenting images supplied by a telescope, an event can be consideredas the occurrence of an astronomical event such as the occurrence of aplanet in the image. A classifier can then classify the event in moredetail, such as in at least one size, shape, color, etc. In thisrespect, the event detector can also be understood as a pre-classifierwhich limits the number of relevance classes available for thisparticular event down to a subset. The subsequent classifier then canonly assign at least one relevance class of this subset.

In embodiments, the storage system can include at least one of thefollowing features: the classification unit includes a set of eventdetectors with each event detector of the set being configured to detecta different event in the data segment received for storing; theclassification unit includes a set of classifiers, each classifier beingassigned to a different one of the event detectors, and each classifierof the set being configured to assign at least one relevance classpre-selected from the set of relevance classes for the event to bedetected by the assigned event detector; a real-time data processingunit for real-time processing of input data segments and providing asequence of data segments for storing; the selector is configured todetermine the storage tier for the classified data segment dependent onthe access frequency information provided by the access patternevaluator for data segments in the same relevance class, and dependenton the relevance class assigned; wherein the selector is configured todetermine the level of protection for the classified data segmentdependent on the relevance class assigned and dependent on the accessfrequency information provided by the access pattern evaluator for datasegments in the same relevance class; the selector is configured todetermine a redundancy level for the classified data segment dependenton the determined protection level and the determined storage tier, theredundancy level specifying a number of copies of the classified datasegment to be stored in the storage unit, and in particular specifying anumber of copies of the classified data segment to be stored in whichstorage tier of the storage unit; the logic is configured to storecopies of the classified data segment in at least one storage tieraccording to the determined level of redundancy; the selector isconfigured to determine at least one error correction code and anerasure code to be applied to the classified data segment dependent onthe determined protection level and the determined storage tier; and thelogic is configured to store the classified data segment with the atleast one of the determined error correction codes and erasure code inthe storage unit.

In another embodiment, multiple event detectors are provided andspecifically each event detector is configured to detect a specificevent that is different from the events the other event detectors areconfigured to detect. This arrangement is preferred in case parallelprocessing is required for big data applications. In such an embodiment,the classifier can be responsible for further classifying the detectedevents. However, it is preferred that one classifier is assigned to eachevent detector such that the number of classifiers corresponds to thenumber of event detectors. The classification can also be alsoparallelized. In another embodiment, multiple classifiers can beprovided in combination with only a single event detector configured todetect different multiple events. In this case, the number ofclassifiers can correspond to the number of events that can be detectedby the single event detector.

In another embodiment, the two step event detection and classificationprocess can be replaced by a single classification step in which thedata segments, which are input to the storage system, are evaluatedversus the complete set of relevance classes. In the exemplaryastronomical application, rather than looking for an astronomical eventfirst and then classifying this event in more detail, the classificationcan be applied without prior event detection. Either way can result inthe very same assignment of relevance class/es, e.g. “planet of size xand color y”. In a different embodiment, the classification unit cansolely include an event detector which at the same time acts as aclassifier specifically when at least one dedicated class is assigned toevent a priori. In a different view, the event can be known a priori,and only event features are to be identified, in which case an eventdetector is not needed.

The set of relevance classes available for tagging a data segment can bedefined up-front and can be fixed and limited in size, or can changedynamically during operation of the storage system and/or duringoperation of a user application making use of the storage system. In oneembodiment, self-learning algorithms are applied for changing and/orrefining the set of relevance classes. For the overall storage system,it is envisaged that at least two relevance classes are provided andavailable for tagging the data segments. Subject to the complexity ofthe user application, hundreds of relevance classes can be available. Inthe case of a use of at least one event detector, the correspondingsubsets of relevance classes that are assignable for a particular eventare defined up-front. A subset can at minimum contain one relevanceclass in case the event is sufficiently defined by such relevance class.

Each relevance class can imply a certain relevance of the data segmentsbeing classified thereto, wherein some relevance classes can refer todata segments with a content considered more important than the contentof data segments assigned to other relevance classes. However, in apreferred embodiment, a relevance class can in the first instance solelyrepresent a description of the content of the subject data segment, suchas in the above example “planet of size x and color y”. Here, therelevance class assigned can rather be regarded as a descriptor for thecontent of the subject data segment. A relevance for the contentdescribed by the descriptor can at least later be added, e.g. by rankingthe descriptors in order of relevance for storage purposes. Hence, theassignment of a relevance class to a data segment can in one embodimentinclude a mapping of descriptors to relevance classes, and e.g. alsoinclude a mapping of multiple different descriptors to a commonrelevance class. Finally, it is desired that a classification is appliedthat at least to some extent assigns a metric to a data segmentreflecting an importance of the content of the data segment for storagepurposes.

The classification unit assigns at least one relevance class to a datasegment to be stored, which is considered to be equivalent to assigningthe data segment to at least one relevance class. Preferably, only onerelevance class is assigned per data segment. It is preferred that alldata segments requested to be stored are classified and labelled by atleast one relevance class. However, there can be envisaged a processingunit that preprocesses data segments arriving at the storage system.Such preprocessing can in one embodiment already lead to a selection ofdata segments to be stored out of all arriving data segments. In oneembodiment, the processing unit is a real-time data processing unit forreal-time processing of arriving data segments, also referred to asinput data segments, e.g. in form of an input data stream, where thereal-time data processing unit supplies a sequence of data segments tobe stored, which are subsequently classified. Specifically, such aprocessing unit can apply at least one filtering operation, suppressionof spurious data segments, removing interference data segments, etc.

Under the assumption that the data segments are provided to the storagesystem as an input data stream, it is preferred that at least one bufferbe provided in the classification unit in order to temporarily bufferthe incoming and/or pre-processed data segments for providing sufficienttime for conducting the classification and the determination ofprotection level and storage tier as will be explained later on. Hence,in one embodiment, a buffer is provided for buffering a data segmentreceived for storage for at least a period of time required by theclassification unit for assigning a relevance class to this datasegment. In case the classification is implemented by multipleclassifiers, one buffer can be provided per classifier, or a commonbuffer can be provided for more or all classifiers. In addition, orindependent from the above buffer/s, another buffer preferably isprovided for buffering a data segment received for storage for at leasta period of time required by the selector for determining the storagetier and the protection level. After the determination, the selector canforward the data segment to be stored together with the classinformation and information as to the determined protection level andinformation as to the suggested storage tier to the storage unit, and inparticular its logic.

In a preferred embodiment, protection of a data segment in the presentcontext can be oriented along the following types of impairmentcategories that a data segment can incur:

-   -   (a) data corruption where bits are altered,    -   (b) data erasure where bits are lost, and    -   (c) temporary data unavailability.

The corresponding metrics for these types of impairments can include:for type (a) impairments, a bit error rate metric; for type (b)impairments, a mean time to data loss (“MTTDL”) metric or a mean annualamount of data lost (“MAADL”) metric; for type (c) impairments, apercentage of time a data segment being unavailable;

These metrics again can be implemented by protection measures includingat least one of the following: for type (a) impairments, a required biterror rate metric can be achieved by applying an error correction codeof a given correction power to the data segment; for type (b)impairments, a MTTDL metric or MAADL metric can be achieved by applyingan erasure code of a given correction power to the data segment; and fortype (c) impairments, a percentage of time a data segment beingunavailable can be limited by providing copies of the data segment inthe storage unit, also referred to as applying a redundancy level.

The protection level to be assigned can be selected from a set ofprotection levels available. In a preferred embodiment, each protectionlevel is defined by a combination of individual impairment levels not tobe underrun in the various impairment categories. The protection levelthen is achieved by a protection measure that addresses at least oneindividually allowed impairment level by corresponding at least onemeasure or a combination thereof, or by selecting a suitable redundancylevel for the data segment; selecting a suitable error correction codefor the data segment; selecting a suitable erasure code for the datasegment. The determination of the redundancy level can in one embodimentspecify the number of copies of the data segment is to be stored in atleast one of the storage tiers.

Through the monitoring of data segment accesses by the access patternevaluator it can be determined which relevance classes or whichcorresponding data segments are more popular than others. Every time adata segment is accessed, the associated metadata information preferablyincluding the relevance class is provided to the access patternevaluator, which learns about a popularity of the information content inthe data segments from the way they are being accessed. Access patternscan be found at various levels including at least one activity duringvarious times of a day, sequence of reads and writes, accesssequentiality, or number of users retrieving the data. This informationis used to preferably further classify data segments into one of severalpopularity classes and shall also be subsumed under the access frequencyinformation.

Any time the access frequency information changes, e.g. the popularityclass changes, such a change can be sent by the access pattern evaluatorto the selector, which accordingly can update a metric for an initialdecision on a level of protection and a storage tier of individual inputdata segments. Therefore, the selector determines the data segmentplacement in the tiered storage and the level of protection based onboth a data relevance classification and data access statistics. In thismanner, a data segment that belongs to a certain relevance class ispassed out to a suitable storage tier and is protected by means at leastachieving the required protection level, e.g. by at least one applyingan error correction code, applying an erasure code, and a redundancylevel that are most appropriate at a particular point in time.

In an embodiment, the selector assigns a protection level and a storagetier placement to an incoming data segment dn at time nT, where 1/T isthe rate at which data segments are received, based on metrics thatdepend on two variables, named “relevance index” ir(ck), and “popularityindex” ip,n(ck), where ck indicates the relevance class such that dnεck.Both ir(ck) and ip,n(ck) are real valued in the interval [0, 1]. Hence,it is apparent that classes are not necessarily restricted to discretelevels but also can be represented by real values as allowed. Note thatthe cardinality of the set of relevance classes is equal to

, given by

=(K1+1)×(K2+1)× . . . ×(KN+1), where Kl denotes the number of classes ofthe l-th classifier, l=1, 2, . . . , N. The relevance index correspondsto the importance of the relevance class as identified by the Nclassifiers, whereas the popularity index corresponds to the popularityof the relevance class as determined by the access pattern evaluator.The popularity index of each class varies over time depending on theaccess pattern, whereas the relevance index varies slowly compared tothe popularity index, as a result of a varying assessment of therelevance of a class. It is assumed that at each time interval a newdata segment is received, sufficient capacity is available at eachstorage tier for a new data segment allocation. The relevance class of anew data segment at the n-th time interval, denoted by dn, is assignedby a classifier, or by the classification unit as such, and therelevance index is given by ir(ck), where dnεck. As the data segment dnis new to the system, its popularity class is ideally chosen as the mostlikely popularity class given that it belongs to relevance class ck orcan be assigned manually by an administrator or user.

An estimate of the most likely popularity index for a data segment thatbelongs to a certain relevance class can be obtained by updating at eachtime interval the popularity index estimate for each relevance class asip,n(ck)=max(ip,n−1(ck)−ε0, 0), if no data segment of class ck isretrieved at the (n−1)-th time interval, or ip,n(ck)=min(1,ip,n−1(ck)+ε1), otherwise, where ε0 and ε1 are constant parameters.

In the absence of data access statistics, e.g. at initialization of astorage system, a correspondence between classes and storage tiers canbe initially assumed, i.e. the higher the relevance class, the higherthe storage tier wherein a hierarchy of the storage tiers is appliedaccording to a single one or a combination of characteristics of thedifferent storage tiers. E.g., a storage tier can be higher in the tierhierarchy if it provides faster access times, etc. However, some timeafter initialization of the storage system, additional information aboutthe popularity of the data segments associated with a certain class isgenerated due to data retrieval activity and this can impact theselection of the storage tier, e.g. the higher the popularity of arelevance class, the higher the storage tier to which a new data segmentin this class is assigned. Again, the storage tier can be regarded assuperior in the tier hierarchy if it provides faster access times, forexample.

In a preferred embodiment, an assignment of a tier placement T(dn)follows: 1) T(dn)=ft(ir(ck), ip,n(ck)), and specificallyT(dn)=ft(ρir(ck)+σip,n(ck)) and an assignment of a protection levelQ(dn) follows: 2) Q(dn)=fq(ir(ck), ip.n(ck)). In a preferred embodiment,a redundancy level U(dn) is assigned to a data segment do as follows: 3)U(dn)=fu(Q(dn), T(dn)), where ft and fq are functions that univocallymap a metric value to a tier level and to a protection level,respectively, ρ and σ are given system parameters, and fu is a functionthat maps a tier and a protection level to a redundancy level. Hence,the determination of both a storage tier and a protection level for adata segment is preferably dependent on both the relevance index and thepopularity index.

TABLE I Associated with Determined by Protection Level Data segmentRelevance (or importance) of data segment Reliability Level Tier Failurecharacteristics of the devices in the tier Redundancy Level Data segmentProtection level of the data segment and the reliability level of thetier(s) in which it is stored

Table I illustrates of the dependencies of the various levels. While inthis embodiment, the protection level is solely dependent on therelevance class, the redundancy level, as the sole or one of moreprotection measures for implementing the assigned protection level, isdependent on this very protection level assigned as well as on thedetermined storage tier. For quantifying the redundancy level, theselected storage tier is preferably represented by its reliability whichcan be classified into a reliability level out of a set of reliabilitylevels given that each type of storage device differs in particular inreliability, e.g. expressed by a bit-error rate. E.g., the bit errorrate of tape is currently in the order of 1 e-19, whereas that of HDDsis in the order of 1 e-15.

Preferably, it is assumed that any storage tier selection is inherentlydependent not only on the parameters assigned to the data segment to bestored, but also on the specifics of the storage tier, which in oneembodiment can be represented by the reliability level into which itsbit error rate can be classified.

The above equations (1) and (2) preferably implement at least one of thefollowing characteristics:

-   -   a) The more relevant (or important) a data segment is, e.g.        expressed by its associated relevance index, the higher its        assigned level of protection;    -   b) The more popular (or frequently accessed) a data segment is,        e.g. expressed by its associated popularity index, the faster        the access it requires, i.e. the faster the storage tier that is        selected for storage.

The following Table II illustrates an assignment of a storage tier and aprotection level to a data segment according to this embodiment of thepresent invention:

TABLE II Relevant data Less important data Popular data High level ofprotection Low level of protection Faster tier Faster tier InfrequentlyHigh level of protection Low level of protection accessed data Slowertier Slower tier

A protection level can be implemented by applying at least one definederror correction code/s to the data segment, applying an erasure codeacross devices—such as RAID for HDDs—or storing the data segment anumber of times in the same or in different tiers for providingredundancy. A combination of the means applied is also referred to asprotection scheme or protection measure.

The following Table III illustrates an assignment of a storage tier anda protection level according to an embodiment of the present invention,wherein the protection level of a data segment is determined by therelevance class assigned, and wherein the storage tier is selected e.g.dependent on the access frequency information for the subject relevanceclass. The less frequently the data segment is accessed the lower is thestorage tier in which the data segment is stored. However, a lowerstorage tier can not only be slower in access time but also be lessreliable. The data segment can a priori be assigned to a less reliablestorage tier in view of more preferred storage tiers already beingoccupied. The requested protection level can still be achieved viadetermining a suitable redundancy level. According to Table III,relevant data segments that require a high level of protection cantherefore be stored on a less reliable storage tier, however, inmultiple copies in this storage tier thereby providing a high level ofredundancy. Alternatively, the data segments requiring a high level ofprotection can be stored in a more reliable storage tier. However,requiring only a moderate number of copies in this storage tier, i.e. amoderate level of redundancy. In a third alternative, multiple copies ofsuch data segments can be stored across multiple tiers.

TABLE III Relevant data Less important data Less reliable tier Highlevel of protection Low level of High level of redundancy protectionModerate level of redundancy More reliable tier High level of protectionLow level of Moderate level of redundancy protection Low level ofredundancy Multiple tiers High level of protection Low level of Multiplecopies stored across protection tiers Fewer copies stored across tiers

After the various determinations, the selector can forward the datasegment to be stored together with the relevance class information, theprotection level information and information as to the suggested storagetier to the storage unit, and in particular its logic. In a preferredembodiment, the required protection measure is also already determinedby the selector and submitted to the logic.

Logic is provided for storing the data segment in the determined storagetier and for implementing the determined level of protection. Theprotection level can therefore in one embodiment be translated into aprotection measure including at least one storing a number of copies,also referred to as redundancy level, selecting an error correctioncode, or selecting an erasure code. Alternatively, if the protectionmeasures are already determined by the selector, the logic can applythese protection measures. The logic can be implemented in hardware,software or a combination of both and is meant to be the entityexecuting the suggestion taken by the selector.

A data segment finally stored in the assigned storage tier is preferablystored together with the assigned relevance class and the assignedprotection level. These levels can be stored in combination with othermetadata for the specific data segment.

According to a preferred embodiment of the storage system, however, alsoas an aspect independent from the previously introduced embodiments ofthe storage system and the corresponding storage unit, a storagerelocation manager is introduced. In a dynamic storage system, thepopularity of each data segment as well as its relevance, although to alesser extent, can change over time. Hence, a unit referred to asstorage relocation manager can be in charge for moving data segments toother storage tiers of the storage unit, also referred to as targetstorage tiers. For example, when the popularity of a data segmentincreases it can be desirable to move it from a present slow storagetier to a faster storage tier to enable quicker access. When thepopularity of a data segment decreases it can be desirable to move itfrom a fast present storage tier to a slower storage tier to free upspace for other popular data segments. However, any movement solelybased on the popularity index can have impact on the protection level,too, e.g. when the target storage tier has a different reliability thanthe present storage tier. The same is true when a data segment isreplicated across multiple tiers.

In a preferred embodiment, the storage tier relocation manager isconfigured, in case of a relocation, to apply the protection measure byselecting at least one of: a redundancy level specifying a number ofcopies of the data segment to be stored in at least one of the targetstorage tier and a non-target storage tier; an error correction code tobe applied to the data segment; and an erasure code to be applied to thedata segment. The storage tier relocation manager is further configuredto store the data segment in the storage unit according to at least oneselected at the redundancy level, the error correction code and theerasure code.

In a preferred embodiment, the storage relocation unit, which is alsoreferred to as the migrator, receives information from an access patternevaluator such as described above and as such receives access frequencyinformation for the individual relevance classes. This access frequencyinformation enables the migrator to place data segments in the rightstorage tier to enhance access performance. Specifically, the migratorcan move data segments stored in a present storage tier to anotherstorage tier if such movement is indicated by the present accesspatterns of such data segments, and specifically by the access patternsof the class in which the respective data segment belongs. In anotherembodiment, the migrator can in addition monitor a relevance classassigned to the data segment and, specifically a change in suchrelevance class, which can also lead to a relocation of the data segmentto a different storage tier.

It can be desirable that more relevant and popular data segments deservea higher level of protection. To ensure a certain protection level in agiven storage tier, a protection scheme is employed which is understoodas a combination of protection measures to implement the desiredprotection level. The protection scheme can entail a combination oferror correction codes within devices—e.g. for type (a) impairments—,erasure codes across devices—e.g. for type (b) impairments—, andreplication across devices—e.g. for type (c) impairments as laid outabove. However, when observing a different access frequency than in thepast which can advise to move a data segment to a different storagetier, i.e. the target storage tier, the protection level in the targetstorage tier can be different than that of the present storage tier. If,on the other hand, the relevance of the data segment has not changed,the protection scheme preferably is to be amended. This is alreadybecause different storage tiers exhibit different levels of reliability,e.g., the bit-error rate of tape is 1e-19 whereas that of HDDs is 1e-15.Consequently, when moving data segments from one storage tier toanother, the migrator preferably adapts the applied protection scheme inorder to maintain the same protection level, e.g., by at least onechanging between 2-way versus 3-way replication, applying errorcorrection and/or erasure codes with different number of parities.

For each data segment dl, l=1, . . . , L, stored in the storage unit, anaccess pattern evaluator such as the one described above preferablyassigns a popularity class c′j and an associated popularity indexip(c′j) which are determined by the number of accesses and the amountsof data read and written to each data segment in the recent history oftime period T1. The popularity class of each data segment isperiodically sent by the access pattern evaluator to the migrator withtime period T2. The migrator then uses this information along with therelevance index ck of each of each segment to determine a target tierTn(dl), the new protection level Qn(dl), and the new redundancy levelUn(dl) for that data segment for the time period nT2 to (n+1)T2 usingexpressions similar to (1), (2), and (3), e.g.:

Tn(dl)=ft(ir(ck),ip(c′j),C1, . . . ,CM,P1, . . . ,PM),  (4)

Qn(dl)=fq(ir(ck),ip(c′j)),  (5)

Un(dl)=fu(Qn(dl),Tn(dl),R1, . . . ,RM),  (6)

wherein ft and fq are functions e.g., linear, that univocally map ametric value to a storage tier and to a protection level, respectively,and fu is a function that maps a tier and a protection level to aredundancy level. Here, C1, . . . , CM are the costs per gigabyte, P1, .. . , PM are the power consumption of a device, and R1, . . . , RM arethe reliability indices which are metrics for the levels of reliabilityfor each of the M tiers.

In one embodiment, for cost reduction, a certain protection level can beguaranteed by placing copies of data segments across multiple storagetiers. For example, a data segment with high relevance index and low tomoderate popularity index can have one replica on an HDD storage tierfor performance purposes, and another replica on a tape storage tier forreliability and cost purposes. It is known that erasure codes canprovide much higher storage efficiency than replication for the samelevel of reliability. On the other hand, erasure codes can suffer fromreduced access performance. Therefore, depending on the relevance andpopularity indices, a choice can be made between an erasure code andreplication based on the trade-off between storage efficiency andperformance.

As described in connection with the selector, it is preferred that theTables I, II and III also apply in the migration of already stored datasegments, preferably in connection with the level of protection beingspecified for data segments in terms of certain metrics, e.g., MTTDL,availability, delay, etc, which can be associated with relevance classesand preferably popularity; in connection with the level of devicereliability being specified in some metric (MTTF, . . . ) such as afailure or error characteristics of the storage devices/tiers, and inconnection with the level of redundancy specifying parameters of anunderlying redundancy scheme. The levels of protection for the datasegments and the levels of device reliability for the device/s usedwithin each tier are preferably known prior to a data segmentreplacement. The levels of redundancy are preferably determined suchthat the protection level for each data segment is guaranteed when thedata segment is placed in a target tier.

The process introduced above is also referred to as dynamic tiering andcan typically occur over large time scales compared to a time interval Tover which a data segment is received for storage. The policies,according to which data is moved across different storage tiers andhence different types of storage devices, depend on access patterncharacteristics and in addition preferably on the assigned relevanceclass. Depending on the storage device performance characteristics,certain tiering strategies can be better for a given workload thanothers. For instance, data segments accessed sequentially are preferablyplaced on HDDs, whereas randomly accessed data are preferably be placedon SSDs. Also, it is conceivable that the updated information regardingthe popularity of the data segments associated with the variousrelevance classes in one embodiment is used to determine subsequent datasegment movements. This, in turn, can steer the employment of effectivecaching and tiering strategies that have a significant effect on thecost and performance of the storage system.

In embodiments, the storage capacity manager can include at least one ofthe following features: the storage capacity manager is configured tomanage a storage unit including at least two storage tiers; themonitoring unit is configured to determine if the utilization of atleast one of the at least two storage tiers fulfils the criterion; thecapacity managing unit is configured to, in response to the utilizationof at least one storage tier fulfilling the criterion, selecting the atleast one data segment stored in this storage tier for one of a deletionthereof or a deletion of a copy thereof in the same storage tier or adifferent storage tier; the monitoring unit is configured to determineif the utilization of the storage unit falls below a capacity threshold;the capacity managing unit is configured to select, in response to theutilization of the storage unit falling below the capacity threshold,the at least one data segment; the capacity managing unit is configuredto one of: suggest the at least one selected data segment for deletionor deletion of a copy thereof; suggest the at least one selected datasegment for deletion or deletion of a copy thereof and delete the atleast one selected data segment or a copy thereof in response to a userconfirmation; delete the at least one selected segment or copiesthereof; the capacity managing unit is configured to assign a retentionclass out of a set of retention classes to the data segments in thestorage unit, each retention class out of the set indicating a measurefor retaining the assigned data segment in the storage unit, theassignment of the retention class to a data segment being made dependenton the relevance class assigned to the data segment; the capacitymanaging unit is configured to select the at least one data segment forone of a deletion thereof or a deletion of a copy thereof dependent onthe retention class assigned; the capacity managing unit is configuredto select the at least one data segment for deletion if thecorresponding data segment is within a number of n data segments showingthe lowest retention classes assigned; the capacity managing unit isconfigured to manage the storage of data segments in a variable numberof copies; the capacity managing unit is configured to select the atleast one data segment for deleting at least one copy thereof if thecorresponding retention class is below a retention threshold; and thecapacity managing unit is configured to determine the retention class toassign to a data segment in addition dependent on at least one of an ageof the data segment; access frequency information for the data segmentor for the relevance class the data segment is assigned to; apersistence index; a storage capacity available in at least one of theother storage tiers in case of a tiered storage unit.

According to a preferred embodiment of the storage system, however, alsoas an aspect independent from the previously introduced embodiments ofthe storage system and the corresponding storage unit, a storagecapacity manager is introduced because of the finite capacity of thestorage unit, and a foreseen large amount of data segments steadilycreated within a big data system, which will likely make it necessary todiscard obsolete data segments and/or to judiciously increase thestorage system capacity of the storage unit. The storage capacitymanager preferably has the main functionality of avoiding a storage unitcapacity overflow by suggesting deleting the least relevant datasegments from the storage unit, and/or by reducing a redundancy of datasegments, i.e. deleting at least one copy of at least one data segment,and in particular by deleting at least one copy of at least one datasegment belonging to a certain relevance class, and/or by providingrecommendations to a system administrator for a capacity extension ofthe storage unit. For instance, whenever the stored data segmentsapproach an available capacity of the storage unit which can beconsidered as a criterion of a utilization of the storage unit beingfulfilled for initiating action the fulfilling of which criterion ismonitored by a monitoring unit, and in particular if new storagecapacity cannot be made available, a capacity managing unit of thestorage capacity manager can select at least one data segment stored inthe storage unit and can suggest these data segments or copies thereoffor removal, i.e. erasure from the storage unit, or delete the selectedat least one data segment or copy thereof, or delete the selected atleast one data segment or copy thereto after having suggested fordeletion to a user or to an administrator and after having received aconfirmation for doing so.

The storage capacity manager can act on an individual storage unit suchas an HDD, a tape, or an SDD, and as such detached from the previouslydescribed multi-tiered storage unit. However, in case of the storageunit including multiple storage tiers, the storage capacity manager canact on each storage tier individually or on the storage unit as a whole.Hence, the utilization of the storage unit fulfilling a criterion suchas falling below a capacity threshold and therefore indicating ashortage of storage capacity in the storage unit, can refer to anindividual tier of the storage unit or to the overall storage unit.Hence, in one embodiment, it can suffice that the monitoring unitdetects an individual storage tier falling short of free capacity andtherefore triggering a selection process for finally suggesting and/ordeleting selected data segments in this specific storage tier. Inanother embodiment, the criterion can be set such that the totalcapacity of the storage unit including the multiple storage tiers iscompared to a capacity threshold and initiates the selection process. Inyet another embodiment, the detection of the storage capacity of anindividual storage tier falling below a capacity threshold can trigger aselection of data segments out of the entire storage unit not limited tothe data segments stored in the storage tier that falls short in freecapacity. It is noted that in the case of a tiered storage unitthresholds indicating a shortage of free capacity can be set differentfor different storage tiers.

The monitoring unit for monitoring the fulfillment of the criterionrelated to the storage capacity or of a part of the storage unit can beembodied as hardware or software or a combination of both. Theutilization of the storage unit can in one embodiment be represented bythe still available storage capacity of the respective unit or of anindividual storage tier, or by the utilized, i.e. occupied and/orreserved storage capacity of the storage unit or by an individualstorage tier. Preferably, the criterion indicates a shortage of stillavailable storage capacity in the respective storage tier or unit. Inanother embodiment, the criterion can be a rate at which new datasegments are stored in the storage tiers or in the storage unit as awhole.

The capacity managing unit can be embodied as hardware or software or acombination of both and be implemented together with the monitoring unitin a dedicated processing unit. The capacity managing unit preferably isconfigured to select at least one of the data segment or copy thereofthat can be considered as more suitable for erasure than others.Accordingly, the selection is taken dependent on at least a relevancemetric indicating the value of each data segment, i.e. the relevanceclasses introduced before. It is preferred, that the data segments withthe least relevance metric, i.e. the lowest relevance class indicatingthe lowest relevance of the corresponding data segment be suggested forerasure or at least for erasure of copies thereof. In one embodiment,the capacity managing unit takes a class-wise approach and suggests thedata segments belonging to a common relevance class for erasure withoutdifferentiating between the data segments within such relevance class.In a different embodiment, however, the capacity managing unit takes anindividual approach to data segments and can even differentiate betweenimportance values of data segments with a common relevance class, e.g.by means of further evaluation of the content of the data segments, orby means of applying additional information available for the datasegments.

The selection can be performed dependent on additional parameters, suchas at least one access frequency to the subject data segments, an age ofthe data segments, a persistence metric assigned to the data segments,an obsolescence of data segments, etc.

In a preferred embodiment, the following metric is introduced for thestorage capacity manager to determine which data segments are selectedfor further action, e.g. for deletion, suggestion for deletion, or areduction or suggestion for reduction of redundancy. R(dl)=fR(ir(ck),i′p(c′j), ia(dl), is(dl)) or, in a more specific embodiment:

R(dl)=fR(γir(ck)+δi′p(c′j)+ηia(dl)+κis(dl)),  (7)

where ir(ck) is a relevance class to which data segment dl belongs,i.e., dlεck, where the index l denotes the data segment number; i′p(c′j)is a popularity class c′j to which data segment dl belongs, i.e.,dlεc′j; ia(dl) is an age of the data segment dl; is(dl) is a persistenceof a data segment dl.

In general, the popularity class is different from the relevance class,and the popularity class of a data segment can vary with time. Thepopularity class can be defined as in connection with the storage systemdescribed above, and a determination of which can be supported by anaccess pattern evaluator such as described above. The relevance classcan be determined by means of a classification unit such as describedabove, and can be stored as metadata together with the data segment inthe storage unit. The age of the data segment can denote the age forwhich the data segment resides in the storage unit. The persistence of adata segment can in one embodiment be defined by a user or anadministrator of the storage unit and specifically can take a value inthe interval [0, 1], where persistence level 1 means “never delete”, and0 means “obsolete data”.

In a preferred embodiment, the storage capacity manager applies thefollowing rule: If R(dl)<thr1 then delete data segment dl; Ifthr1<R(dl)<thr2 then reduce a redundancy of data segment dl; note thatthe term κ is(dl) alone must be able to be >thr1 to avoid unintendeddeletion; If thr2<R(dl) then keep data segment dl unmodified.

It is preferred that at least one copy of a data segment is suggestedfor removal first before removing the data segment as such, i.e. allcopies thereof. Instead of or in addition to a suggestion or a removalof data segments or copies thereof, or in case the storage capacitymanager determines that all existing data segments in the storage unitare still important, a recommendation can be made by the storagecapacity manager to a user or an administrator as to expand the storagecapacity of the storage unit or at least a tier of the storage unit incase of a multiple tier storage unit.

An automatic recommendation for a storage capacity expansion can bebased on or more of: a computation of a capacity required to extend thepresent storage capacity by x %; or to serve storage requirements forthe next y months, based on a historic capacity growth rate; adetermination of a storage device mix based on at least one storage unitneeds, a current storage tier utilization, a historic capacity growthrate per storage tier, etc.

A storage system as suggested in various embodiments addresses thecontent of the data segments to be stored and preferably classifies thedata segments in real-time. Preferably, each data segment to be storedis associated with a relevance index reflecting the assigned relevanceclass and a popularity index reflecting the access frequency for datasegments of the same relevance class in the data storage. Based on atleast this two-fold information, the storage system allows a fullautomatic selection of an appropriate level of protection for each datasegment, and a full automatic selection of a storage tier a certain datasegment is to be initially stored, all without human intervention.

A heterogeneous storage infrastructure, including e.g. solid-statedrives, hard-disk drives, and tape systems, can efficiently be used.Performance, reliability, security, and storage efficiency at lowoperating cost and power consumption are achieved by evaluating theimportance of the stored information for the purpose of, e.g., ofunequal data protection, intelligent tiering, and eventually erasure ofobsolete data.

As explained in the previous sections, in embodiments of the storagesystem different levels of protection are granted to data segments to bestored, depending on the relevance of the information contained. In oneembodiment, it is assumed that data segments received for storage areclassified by a classifier into one out of K+1 relevance classes,depending on their information content. Preferably, data segments withpoor information content due to, e.g., calibration procedures orpresence of interference, are assigned to Class 0, and preferably arediscarded or stored at the lowest possible cost. Data segments in theremaining K classes are input to K different block encoders for errorcorrecting codes. Each encoder can be characterized by parameters ni andki, where ki is the number of data symbols being encoded, and ni is thetotal number of code symbols in the encoded block. Specifically, amulti-tiered storage system with seven relevance classes is considered,where data segments are assigned to the various relevance classesaccording to a binomial distribution with parameter p. Again, the datasegments assigned to Class 0 are assumed to be irrelevant. The datasegments in Classes 1 to 6 are then encoded with a RS (64,ki) code fromGF(28), where ki goes from 60 to 40, i.e., the code length n is held ata constant value equal to 64, whereas the number of data symbols isgiven by ki=64-4i, i.e., the number of data symbols decreases from k1=60to k6=40. The redundancy thus increases from 4 symbols within a codewordfor Class 1 to 24 symbols for Class 6. To assess a gain in storageefficiency that is obtained by the assumed storage system, consider anapplication where the data segments correspond to images with 100×100pixels. Data segments might be assigned to Class 0 and discarded ifcollected, e.g., during calibration of experiments or in the presence ofinterference. For a random channel bit-error probability of 10-3, thesix classes define sequences of images where in the average one pixel isin error every 1, 102, 105, 108, 1011, 1014 images after retrieval,respectively. The efficiency gain obtained by the considered system withunequal error protection and binomial class probability distributionover a system that adopts RS encoding by a (64,40) code from GF(28) forall data segments, is given in percent by

$g = {\left( {{\frac{64}{40}\frac{1}{\sum\limits_{k = 1}^{6}{\begin{pmatrix}6 \\k\end{pmatrix}{p^{k}\left( {1 - p} \right)}^{6 - k}\frac{64}{64 - k}}}} - 1} \right) \times 100}$

A storage system as introduced, which can also be referred to ascognitive data storage system, preferably is applied for big dataapplications. In such storage system, information can be efficientlyextracted, stored, retrieved, and managed. Preferably, in a first step,online detection and classification techniques are applied on incomingdata segments. In this step, the occurrence of events that areassociated with valuable information are preferably detected andclassified. Preferably, in a second step, the result of theclassification procedure together with information about the accesspatterns of similarly classified data is used to determine with whichlevel of protection against errors, and within which tier of the storagesystem the incoming data segments are to be initially stored.

For instance, this cognitive approach could be useful for application inan existing system (such as LOFAR) or in the future square kilometerarray (SKA) telescope system. In particular, it can be applied tooptimize future data access performance. Various workloadcharacteristics can be evaluated for data placement optimization, suchas sequentiality and frequency of subsequent data accesses. Based onthis information, the appropriate tier for storing the data can bedetermined. Moreover, predictions regarding subsequent data accesses canenable effective caching and pre-fetching strategies.

In the specific embodiment of the Square Kilometer Array, the functionsof the classification unit can be performed by an enhanced version of aScience Data Processor (“SDP”). The SDP preferably has the task toautomatically calibrate and image the incoming data, from which scienceoriented generic catalogues can be automatically formed prior to thearchiving of images that are represented by the incoming data segments.Note that an event detector/classifier pair in the classification unitcan face the challenging task of determining in real time a set offeatures related to a detected event, for example real time detectionand machine-learned classification of variable stars from time-seriesdata. In this case, the detection of variable stars using the leastsquares fitting of sinusoids with a floating mean and over a range oftest frequencies, followed by tree-based classification of the detectedstars can in one embodiment be well suited for online implementation.Within the current SKA architecture, the functions of a Multi-TierStorage (“MTS”) system preferably are performed by an enhanced versionof a Science Data Archive Facility.

For applications within the healthcare industry, the functions of theclassification unit preferably depend on the context of the data beingstored. For example, if data segments being collected are used for acohort study, the parts that are relevant to the study can be classifiedas more important than other data. In the context of personalizedmedicine, medical records can be identified by their type, e.g.,biochemistry, hematology, genomics, hospital records. Within each type,relevant features can be classified and associated with a certain levelof importance.

FIG. 1 shows a block diagram of a storage system according to anembodiment of the present invention. The storage system includes twomain subsystems, i.e. a classification unit 1 also referred to as RealTime Processing & Classification (“RTPC”), and a storage unit 2 alsoreferred to as MTS. In the classification unit 1, an incoming datastream containing data segments is elaborated by a real-time processingunit 15, typically to perform at least one filtering operation,suppression of spurious data segments, e.g., removing interference inthe context of astronomical data application, ensuring privacy ofmedical records by pseudonymization in the context of cohort studies inthe healthcare industry, or extracting relevant information from medicalrecords in the context of personalized medicine. An output of thereal-time processing unit 15 is presented to a set of N online eventdetectors 11. Each of the N event detectors 11 determines whether theoccurrence of an event, which can be associated with predefinedinformation, is detected within a segment of the incoming data stream.Each event detector 11 of the set can be configured to detect a specificevent that is different from the events the other event detectors 11 ofthe set are expected to detect.

In general, real-time classification can refer to any initial dataevaluation that can take place while guaranteeing a predeterminedsustained rate of the incoming data stream. Whenever a relevant event isdetected by one of the N event detectors, an associate online classifier12 assigns the data segment, which contains the information related tothe event, to one of K+1 relevance classes with K≧0 depending, e.g., onthe presence or absence of features that characterize the event. Datasegments, where event-related information is not detected, are assignedby default to a Class 0. Note that a set of N buffers 13 is included inthe data paths to compensate for delays introduced by the associateevent detector 11 and classifier 12. Also note that in a preferredembodiment, several pairs of event detectors 11 and classifiers 12 canbe operating in parallel, if events of different nature are deemedrelevant, as illustrated in FIG. 1 for N detector 11/classifier 12pairs, with N=3. In this case, a data segment can be associated withmultiple tags assigned by the various classifiers 12, and a buffer 16 inthe main data path to the storage unit 2 preferably is dimensioned toaccommodate a largest delay expected to be introduced by theclassifiers.

In other embodiments, only the occurrence of a single event is desiredto be detected, in which case the event detector at the same time actsas classifier—or the classifier acts as event detector. In a differentembodiment, the occurrence of an event can be known a priori and onlyevent features are desired to be identified, in which case the eventdetector/s 11 is/are not needed.

The data segments and the related class information—which classinformation can be subsumed under the data segments metadata—arereceived by a selector 14, which has the task of determining with whichlevel of protection and to which tier each data segment received is tobe stored. This decision depends on the information on the relevanceclass and on an access pattern to this relevance class, which isobtained from an access pattern evaluator 24 assigned to the storageunit 2.

The storage unit 2 receives from the classification unit 1 the processedsequence of incoming data segments to be stored in a multi-tier storage21 containing M storage tiers, together with the individual informationfor each data segment about detected events, and identified relevanceclasses and possibly other features. This information preferably isutilized to assign a protection level for the respective data segmentand an initial placement in one of the available storage tiers 21. AnM-tier storage system with J data segment protection levels, with M=3and J=3, is illustrated for example in FIG. 1. The three storage tiers21 might correspond to different type of storage media, e.g., SSDs,HDDs, and tape. For performance optimization during normal systemoperation, frequently accessed data or randomly accessed data canpreferably be placed on SSDs, whereas less frequently accessed data orsequentially accessed data can be stored on HDDs or on tape.

Prior to being stored on the physical media corresponding to theselected storage tier 21, each data segment is presented to an encoder23, which provides different levels of protection, for example usingunequal error protection (“UEP”), depending on the relevance classinformation. In an embodiment, compression and/or deduplication of thedata segments can be considered in addition to UEP. A data segment witha high relevance class preferably is associated with a high value. Itsinformation content is such that a loss would be associated with a highcost, and therefore the data segment is protected with a higher level ofredundancy. The required level of redundancy can be provided by errorcorrection coding or erasure coding, by storing replicas of the datasegments, or by a combination of these techniques. Note that compressionand/or deduplication of the data segments can be considered in additionto UEP.

An access pattern evaluator 24 of the storage unit 2 provides additionalinformation about the popularity of the data segments associated with acertain relevance class. Every time a data segment is accessed in thestorage unit 2, the associated metadata information including the classinformation is provided to the access pattern evaluator, which learnsabout the popularity of the information content in the data segmentsfrom the way they are being accessed. Access patterns can be found atvarious levels, e.g., activity during various times of a day, sequenceof reads and writes, access sequentiality, and number of usersretrieving the data. This information is used to further classify datasegments into one of several popularity classes.

Subsequently, the access pattern evaluator 24 sends information to theselector 14 in the classification unit 1, which accordingly updates ametric for initial decision on level of protection and storage tier ofindividual data segments. Therefore, the selector 14 updates thecriterion for initial data placement based on both data relevanceclassification and data access statistics. In this manner, a datasegment that belongs to a certain relevance class is passed out to thestorage medium and is protected against errors with a redundancy levelthat are most appropriate at a particular point in time. Following aninitial data placement, the access pattern evaluator 24 monitors alldata segments in the storage tiers 21 and places each in the appropriatepopularity class.

In the present embodiment of FIG. 1, a migrator 25 is provided, alsoreferred to as storage relocation manager, which is preferably arrangedin the storage unit 2. The migrator 25 receives information from theaccess pattern evaluator 24, and as such receives access frequencies tothe individual classes. This information enables the migrator 25 toplace data segments in the right storage tier 21 to enhance accessperformance. Specifically, the migrator 25 has the task of moving datasegments stored in one storage tier 21 to another storage tier 21 ifsuch movement is indicated by the present access patterns to such datasegments, and specifically by the access patterns of the class to whichthe respective data segment belongs. This classification enables themigrator 25 to place the stored data segments in the right storage tierto enhance access performance.

In the present embodiment of FIG. 1, storage efficiency and accessperformance are further optimized by continuously monitoring the accesspatterns and updating the criteria for data segment placement in astorage capacity manager 26. When the amount of data segments stored inthe storage unit system approaches a system capacity, an automaticselection of data segments for deletion or deletion of copies thereof isprovided by the storage capacity manager 26. The selected data segmentsand/or copies thereof can be deleted and/or can be suggested fordeletion and/or a system capacity expansion is requested, all preferablybased on the importance and the access patterns of the data segmentsstored in the system.

FIG. 2 illustrates a flowchart of a method for storing a data segment ina storage tier of a storage unit including at least two storage tiers,according to an embodiment of the present invention. In step 21 a datasegment is received for storage. In step 22, at least one out of a setof at least two relevance classes is assigned to the data segmentdependent on information included in the data segment. In step 23, alevel of protection is determined for the classified data segmentdependent on the relevance class assigned. In step 24, a storage tier isdetermined out of the at least two storage tiers for storing theclassified data segment to dependent on at least access frequencyinformation received for data segments in the same relevance class, anddependent on the characteristics of the storage tiers available. In step25, a redundancy level is determined for the data segment dependent onthe determined protection level and the determined storage tier. In step26, the classified data segment including the assigned relevance classis stored in the determined storage tier and copies of the data segmentare stored in the determined storage tier or in a different storage tieraccording to the determined redundancy level.

FIG. 3 illustrates a flowchart of a method for relocating a data segmentpresently stored in a storage tier of a storage unit with at least twostorage tiers, according to an embodiment of the present invention. Instep 31, an access frequency to a relevance class is monitored. In step32, it is verified if a change in an access frequency Δaf to datasegments stored in the storage unit and belonging to the subjectrelevance class exceeds a threshold t1. If not, the access frequency iscontinued to be monitored in step 31. If yes (y), a data segmentassigned to the subject relevance class and stored in a present storagetier is moved to a different storage tier, i.e. a target storage tier.In step 34, protection measures are adapted for this relocated datasegment in order to achieve a protection level assigned to the datasegment.

FIG. 4 illustrates a method for managing a capacity of a storage unitfor storing data segments, according to an embodiment of the presentinvention, preferably executed by a storage capacity manager accordingto an embodiment of the present invention. In step 41 a remainingcapacity C of the storage unit is monitored. In step 42 it is verified,if the remaining storage capacity C is less than a threshold thr1. Ifnot, the remaining capacity C is continued to be monitored. If yes (y),data segments are deleted in step 43 thereby reducing the occupiedstorage space in the storage unit. Step 43 includes the selection ofstorage segments to be deleted. In the present example, the datasegments are selected dependent on a relevance class assigned, and thedata segments assigned to the lowest relevance class are removed. Instep 44, it is verified if the remaining storage capacity C is less thana threshold thr2, which preferably is less than the threshold thr1. Ifnot, the remaining capacity C is continued to be monitored. If yes (y),a storage capacity expansion is recommended to an administrator of thestorage unit.

According to an embodiment of the present invention, a computer programproduct is provided including a computer readable medium having computerreadable program code embodied therewith, the computer readable programcode including computer readable program code configured to perform amethod according to any one of the preceding embodiments.

The present invention can be a system, a method, and/or a computerprogram product. The computer program product can include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium can be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network can includecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention can be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of at least one programming language, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions can execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer can be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection can be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) can execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions can be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionscan also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein includes anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions can also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams can represent a module, segment, or portionof instructions, which includes at least one executable instruction forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block can occur out of theorder noted in the figures. For example, two blocks shown in successioncan, in fact, be executed substantially concurrently, or the blocks cansometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

What is claimed is:
 1. A storage system, comprising: a storage unitincluding at least two storage tiers; an access pattern evaluatorconfigured to provide information about a frequency at which datasegments stored in at least one of the at least two storage tiers areaccessed; a classification unit configured to assign at least one out ofa set of at least two relevance classes to a data segment received forstoring in the storage unit dependent on information included in thedata segment; a selector configured to determine a storage tier out ofthe at least two storage tiers for storing the classified data segmentto, dependent on at least access frequency information provided by theaccess pattern evaluator for data segments in the same relevance class,and configured to determine a level of protection for the classifieddata segment dependent on at least the relevance class assigned; andlogic for storing the classified data segment including the assignedrelevance class to the determined storage tier and according to thedetermined level of protection.
 2. The storage system according to claim1, wherein the classification unit comprises a set of event detectorswith each event detector of the set being configured to detect adifferent event in the data segment received for storing; and whereinthe classification unit comprises a set of classifiers, each classifierbeing assigned to a different one of the event detectors, and eachclassifier of the set being configured to assign at least one relevanceclass pre-selected from the set of relevance classes for the event to bedetected by the assigned event detector.
 3. The storage system accordingto claim 1, comprising a real-time data processing unit for real-timeprocessing of input data segments and providing a sequence of datasegments for storing.
 4. The storage system according to claim 1,wherein the selector is configured to determine the storage tier for theclassified data segment dependent on the access frequency informationprovided by the access pattern evaluator for data segments in the samerelevance class, and dependent on the relevance class assigned; andwherein the selector is configured to determine the level of protectionfor the classified data segment dependent on the relevance classassigned, and dependent on the access frequency information provided bythe access pattern evaluator for data segments in the same relevanceclass.
 5. The storage system according to claim 1, wherein the selectoris configured to determine a redundancy level for the classified datasegment dependent on the determined protection level and the determinedstorage tier, the redundancy level specifying a number of copies of theclassified data segment to be stored in the storage unit, and inparticular specifying a number of copies of the classified data segmentto be stored in which storage tier of the storage unit; and wherein thelogic is configured to store copies of the classified data segment inthe at least one storage tier according to the determined level ofredundancy.
 6. The storage system according to claim 1, wherein theselector is configured to determine at least one of an error correctioncodes and an erasure code to be applied to the classified data segmentdependent on the determined protection level and dependent on thedetermined storage tier; and wherein the logic is configured to storethe classified data segment with at least one of the determined errorcorrection codes and determined erasure code in the storage unit. 7.Method for storing a data segment in a storage tier of a storage unitincluding at least two storage tiers, comprising: assigning at least oneout of a set of at least two relevance classes to the data segmentdependent on information included in the data segment; receivinginformation about a frequency at which data segments stored in at leastone of the at least two storage tiers are accessed; determining astorage tier out of the at least two storage tiers for storing theclassified data segment to dependent on at least access frequencyinformation received for data segments in the same relevance class;determining a level of protection for the classified data segmentdependent on at least the relevance class assigned; and storing theclassified data segment including the assigned relevance class to thedetermined storage tier and according to the determined level ofprotection.
 8. A storage tier relocation manager for relocating a datasegment presently stored in a storage tier of a storage unit with atleast two storage tiers the data segment having assigned a protectionlevel out of a set of protection levels, the storage tier relocationmanager being configured to: determine a target storage tier for thedata segment dependent on access frequency information received for oneor more of the data segment or a relevance class the data segment isassigned to; relocate the data segment to the target storage tier if thetarget storage tier is different from the present storage tier; and incase of a relocation, apply a protection measure suitable for achievingthe assigned protection level.
 9. The storage tier relocation manageraccording to claim 8, being configured to in case of a relocation applythe protection measure by selecting at least one of: a redundancy levelspecifying a number of copies of the data segment to be stored in atleast one of the target storage tiers and a non-target storage tiers; anerror correction code to be applied to the data segment; and an erasurecode to be applied to the data segment; and store the data segment inthe storage unit according to the selected of at least one of theredundancy levels, the error correction code and the erasure code. 10.Method for relocating a data segment presently stored in a storage tierof a storage unit with at least two storage tiers, the data segmenthaving assigned a protection level out of a set of protection levels,comprising: determining a target storage tier for the data segmentdependent on access frequency information received for at least one ofthe data segments or a relevance class the data segment is assigned to;relocating the data segment to the target storage tier if the targetstorage tier is different from the present storage tier; and applying aprotection measure suitable for achieving the assigned protection levelin case of a relocation.
 11. A storage capacity manager for a storagesystem comprising a storage unit for storing data segments having atleast one relevance class assigned per data segment, the storagecapacity manager comprising: a monitoring unit for determining if autilization of the storage unit fulfils a criterion; and a capacitymanaging unit for, in response to the utilization of the storage unitfulfilling the criterion, selecting at least one data segment stored inthe storage unit for one of a deletion thereof or a deletion of a copythereof at least dependent on the at least one relevance class assigned.12. The storage capacity manager according to claim 11, wherein thestorage capacity manager is configured to manage a storage unitcomprising at least two storage tiers; wherein the monitoring unit isconfigured to determine if the utilization of at least one of the atleast two storage tiers fulfils the criterion; and wherein the capacitymanaging unit is configured to, in response to the utilization of atleast one storage tier fulfilling the criterion, selecting the at leastone data segment stored in this storage tier for one of a deletionthereof or a deletion of a copy thereof in the same storage tier or in adifferent storage tier.
 13. The storage capacity manager according toclaim 11, wherein the monitoring unit is configured to determine if theutilization of the storage unit falls below a capacity threshold; andwherein the capacity managing unit is configured to select the at leastone data segment in response to the utilization of the storage unitfalling below the capacity threshold.
 14. The storage capacity manageraccording to claim 11, wherein the capacity managing unit is configuredto: suggest the at least one selected data segment for deletion ordeletion of a copy thereof; suggest the at least one selected datasegment for deletion or deletion of a copy thereof and delete the atleast one selected data segment or a copy thereof in response to a userconfirmation; delete the at least one selected segment or copiesthereof; or a combination thereof.
 15. The storage capacity manageraccording to claim 11, wherein the capacity managing unit is configuredto assign a retention class out of a set of retention classes to thedata segments in the storage unit, each retention class out of the setindicating a measure for retaining the assigned data segment in thestorage unit, the assignment of the retention class to a data segmentbeing made dependent on the relevance class assigned to the datasegment; and wherein the capacity managing unit is configured to selectthe at least one data segment for one of a deletion thereof or adeletion of a copy thereof dependent on the retention class assigned.16. The storage capacity manager according to claim 15, wherein thecapacity managing unit is configured to select the at least one datasegment for deletion if the corresponding data segment is within anumber of n data segments showing the lowest retention classes assigned.17. The storage capacity manager according to claim 15, wherein thecapacity managing unit is configured to manage the storage of datasegments in a variable number of copies; and wherein the capacitymanaging unit is configured to select the at least one data segment fordeleting at least one copy thereof if the corresponding retention classis below a retention threshold.
 18. The storage capacity manageraccording claim 15, wherein the capacity managing unit is configured todetermine the retention class to assign to a data segment in additiondependent on at least one of: an age of the data segment; accessfrequency information for the data segment or for the relevance classthe data segment is assigned to; a persistence index; and a storagecapacity available in at least one of the other storage tiers in case ofa tiered storage unit.
 19. Method for managing a capacity of a storageunit for storing data segments having at least one relevance classassigned per data segment, the method comprising: determining if autilization of the storage unit fulfils a criterion; responding to theutilization of the storage unit fulfilling the criterion; and selectingat least one data segment stored in the storage unit for one of adeletion thereof or a deletion of a copy thereof at least dependent onthe relevance class assigned.
 20. A non-transitory computer programproduct comprising a computer readable medium having computer readableprogram code embodied therewith, wherein the computer readable programcode is configured to perform a method according to claim 19.